ROCm: Add gfx950 (MI355X/CDNA4) to is_cdna() and include PR #4021 fixes by GoldenGrapeGentleman · Pull Request #4050 · unslothai/unsloth

GoldenGrapeGentleman · 2026-02-14T10:17:44Z

Summary

Add AMD Instinct MI355X (gfx950 / CDNA4) support to is_cdna() and include ROCm stability fixes from PR #4021 by @danielhanchen.

Problem

is_cdna() only listed gfx940/941/942 (MI300 series). MI355X (gfx950, CDNA4) has the same 1024-thread workgroup limit but was missing, causing all Triton kernels to use num_warps=32 (2048 threads) instead of 16 (1024 threads):

triton.runtime.errors.OutOfResources: out of resource: threads, Required: 2048, Hardware limit: 1024

This blocked all training on MI355X.

Changes

Add "gfx950" to is_cdna() in unsloth/kernels/utils.py (+1 line)
Include PR ROCm: default GPT-OSS to BF16 and disable AITER #4021 fixes: ROCm GPT-OSS MXFP4→BF16 routing, dequant buffer dtype fix, AITER disable

Verified on 8× AMD Instinct MI355X (gfx950), ROCm 7.1

Test	Result
Vision RL GRPO (Qwen2.5-VL-7B)	✅ 5/5 steps, 117s
Code RL GRPO (gpt-oss-20b BF16)	✅ 20/20 steps, 470s
gpt-oss-120b GRPO (8-GPU)	✅ 5/5 steps, 328s
MoE expert LoRA + merge	✅ 46.2M trainable, merge success

cc @danielhanchen — this includes your PR #4021 changes, cherry-picked and validated on MI355X. The is_cdna() fix is the additional piece needed for CDNA4.

for more information, see https://pre-commit.ci

…x/warn-unsupported-lora-targets

@danielhanchen

MI355X (gfx950) has the same 1024-thread workgroup limit as MI300X (gfx942), but was missing from is_cdna(), causing all Triton kernels to use num_warps=32 (2048 threads) instead of 16 (1024 threads), resulting in OutOfResources crash. Also includes ROCm GPT-OSS BF16 routing and dequant buffer dtype fix from PR unslothai#4021 by @danielhanchen, cherry-picked for MI355X validation. Tested on: 8x AMD Instinct MI355X (gfx950), ROCm 7.1 - Vision RL GRPO (Qwen2.5-VL-7B): 5/5 steps - Code RL GRPO (gpt-oss-20b BF16): 20/20 steps - gpt-oss-120b GRPO: 5/5 steps (B200 OOM'd on this) - MoE expert LoRA + save_pretrained_merged: success

gemini-code-assist · 2026-02-14T10:18:02Z

Summary of Changes

Hello @GoldenGrapeGentleman, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances ROCm compatibility and stability by extending support to AMD Instinct MI355X (gfx950/CDNA4) GPUs, which directly addresses Triton kernel thread limit issues. It also integrates a suite of stability fixes from a previous pull request, focusing on robust model loading for GPT-OSS on HIP devices, refining dequantization logic, and proactively mitigating potential AITER-related problems on ROCm.

Highlights

ROCm GFX950 (MI355X/CDNA4) Support: Added support for AMD Instinct MI355X (gfx950/CDNA4) to the is_cdna() check, resolving critical Triton kernel thread limit errors that previously blocked training on these devices.
Integrated PR ROCm: default GPT-OSS to BF16 and disable AITER #4021 Stability Fixes: Incorporated several stability improvements from PR ROCm: default GPT-OSS to BF16 and disable AITER #4021, including enhanced routing for GPT-OSS models on ROCm, a fix for dequantization buffer dtype handling, and default disabling of AITER to prevent JIT build locks and runtime faults on ROCm.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

unsloth/init.py
- Wrapped torch.cuda.is_bf16_supported() and torch.xpu.is_bf16_supported() in local is_bf16_supported functions for hip and xpu devices.
unsloth/device_type.py
- Imported the os module.
- Added default environment variable settings to disable AITER (AITER_DISABLE=1) and USE_ROCM_AITER_ROPE_BACKEND=0 for HIP devices to prevent JIT build locks and runtime faults.
unsloth/kernels/utils.py
- Extended the is_cdna() function to include "gfx950" (CDNA4 MI350/MI355X) in the list of CDNA architectures.
- Modified fast_dequantize to reallocate WEIGHT_BUFFER if its dtype changes, in addition to checking if it's None.
unsloth/models/_utils.py
- Enhanced patch_gradient_accumulation_fix to include try-except blocks and checks for the generated _unsloth___init__ function, improving robustness during patching.
- Updated _prepare_model_for_qat to attempt initializing Int4WeightOnlyConfig with version=2 and included a TypeError fallback for older TorchAO versions.
unsloth/models/loader.py
- Introduced a new internal function _route_hip_gpt_oss_model to manage model loading for GPT-OSS on HIP devices, specifically routing to BF16 models for Instinct/MI GPUs if pre-quantized 4-bit models are not suitable.
- Integrated the _route_hip_gpt_oss_model call into the from_pretrained function at two different points.
unsloth/save.py
- Added a condition to skip GGUF conversion and return early if the UNSLOTH_GGUF_OFFLINE environment variable is set to "1".
- Included an early return with default values if all_file_locations is empty, preventing index errors after a skipped or failed GGUF conversion.

Activity

The author has validated the changes on 8x AMD Instinct MI355X (gfx950) with ROCm 7.1, providing detailed test results for various models and tasks.
The pull request incorporates cherry-picked changes from PR ROCm: default GPT-OSS to BF16 and disable AITER #4021 by @danielhanchen, indicating collaboration and integration of prior work.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for the AMD Instinct MI355X (gfx950) GPU and incorporates several stability fixes for ROCm, which is a valuable enhancement. The changes are logical and well-implemented, particularly the safety improvements around dynamic code execution and buffer handling. I've identified a couple of areas with code duplication that could be refactored to improve long-term maintainability. Overall, this is a solid contribution that improves hardware support and the robustness of the library.

gemini-code-assist · 2026-02-14T10:19:24Z

unsloth/__init__.py

 elif DEVICE_TYPE == "hip":
    SUPPORTS_BFLOAT16 = torch.cuda.is_bf16_supported()
+
+    def is_bf16_supported():
+        return SUPPORTS_BFLOAT16
 elif DEVICE_TYPE == "xpu":
    # torch.xpu.is_bf16_supported() does not have including_emulation
    # set SUPPORTS_BFLOAT16 as torch.xpu.is_bf16_supported()
    SUPPORTS_BFLOAT16 = torch.xpu.is_bf16_supported()

+    def is_bf16_supported():
+        return SUPPORTS_BFLOAT16


To improve maintainability and reduce code duplication, you can refactor this logic. The is_bf16_supported function is defined identically for both hip and xpu device types. Combining the elif blocks for hip and xpu and defining the function only once would make the code cleaner.

Suggested change

elif DEVICE_TYPE == "hip":

SUPPORTS_BFLOAT16 = torch.cuda.is_bf16_supported()

def is_bf16_supported():

return SUPPORTS_BFLOAT16

elif DEVICE_TYPE == "xpu":

# torch.xpu.is_bf16_supported() does not have including_emulation

# set SUPPORTS_BFLOAT16 as torch.xpu.is_bf16_supported()

SUPPORTS_BFLOAT16 = torch.xpu.is_bf16_supported()

def is_bf16_supported():

return SUPPORTS_BFLOAT16

elif DEVICE_TYPE in ("hip", "xpu"):

if DEVICE_TYPE == "hip":

SUPPORTS_BFLOAT16 = torch.cuda.is_bf16_supported()

else: # xpu

# torch.xpu.is_bf16_supported() does not have including_emulation

# set SUPPORTS_BFLOAT16 as torch.xpu.is_bf16_supported()

SUPPORTS_BFLOAT16 = torch.xpu.is_bf16_supported()

def is_bf16_supported():

return SUPPORTS_BFLOAT16

gemini-code-assist · 2026-02-14T10:19:24Z

unsloth/models/loader.py

+        (
+            model_name,
+            load_in_4bit,
+            load_in_8bit,
+            load_in_fp8,
+            load_in_16bit,
+            quantization_config,
+        ) = _route_hip_gpt_oss_model(
+            model_name = model_name,
+            use_exact_model_name = use_exact_model_name,
+            load_in_4bit = load_in_4bit,
+            load_in_8bit = load_in_8bit,
+            load_in_fp8 = load_in_fp8,
+            load_in_16bit = load_in_16bit,
+            quantization_config = quantization_config,
+            kwargs = kwargs,
+        )


This block of code for routing HIP GPT-OSS models is duplicated in FastModel.from_pretrained at lines 880-896. To improve maintainability and reduce redundancy, consider refactoring this logic into a shared helper method that both FastLanguageModel.from_pretrained and FastModel.from_pretrained can call. This would centralize the model routing logic, making future updates easier.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 41c5a9639f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-02-14T10:21:26Z

unsloth/models/loader.py

+            if not lower_model_name.endswith("-bf16"):
+                if "120b" in lower_model_name:
+                    model_name = "unsloth/gpt-oss-120b-BF16"
+                else:
+                    model_name = "unsloth/gpt-oss-20b-BF16"


Restrict HIP GPT-OSS remap to canonical model IDs

_route_hip_gpt_oss_model rewrites matched names to unsloth/gpt-oss-20b-BF16/120b-BF16 based only on substring matching, so on HIP it can replace requested non-base models (e.g. unsloth/gpt-oss-safeguard-20b, which is a valid mapped ID in unsloth/models/mapper.py:1246-1252) and local checkpoint paths that include gpt-oss. In those cases the loader silently fetches different weights than the caller asked for, which can invalidate training/evaluation results.

Useful? React with 👍 / 👎.

danielhanchen and others added 18 commits February 10, 2026 14:36

ROCm: default GPT-OSS to BF16 and disable AITER

4e1e24c

ROCm: guard Trainer init patch against missing generated function

3daaff6

[pre-commit.ci] auto fixes from pre-commit.com hooks

cff534d

for more information, see https://pre-commit.ci

ROCm GPT-OSS: gate BF16 fallback by prequant capability

24c7f2e

[pre-commit.ci] auto fixes from pre-commit.com hooks

458af41

for more information, see https://pre-commit.ci

ROCm: trim unintended fast-inference fallback behaviors

a211e8c

Refactor HIP GPT-OSS routing into shared loader helper

b56ab6a

Move HIP GPT-OSS routing helper to loader footer

4f138ac

ROCm notebook stability: deepseek OCR hook + offline GGUF guard

9956d1d

[pre-commit.ci] auto fixes from pre-commit.com hooks

28aa6c2

for more information, see https://pre-commit.ci

Fix dequant global buffer dtype reuse across mixed precision

f0da826

Remove redundant Deepseek OCR patch call from vision loader

734649e

ROCm notebook stability: deepseek OCR hook + offline GGUF guard

8b12e72

[pre-commit.ci] auto fixes from pre-commit.com hooks

7369727

for more information, see https://pre-commit.ci

Fix dequant global buffer dtype reuse across mixed precision

5ac8f45

Remove redundant Deepseek OCR patch call from vision loader

897a004

Merge remote-tracking branch 'pr4021/rocm-gpt-oss-bf16-aiter' into fi…

8729bb5

…x/warn-unsupported-lora-targets

GoldenGrapeGentleman requested review from Datta0, danielhanchen, mmathew23 and rolandtannous as code owners February 14, 2026 10:17

gemini-code-assist bot reviewed Feb 14, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Feb 14, 2026

View reviewed changes

GoldenGrapeGentleman closed this Feb 14, 2026

GoldenGrapeGentleman mentioned this pull request Feb 14, 2026

ROCm: Add gfx950 (MI355X/CDNA4) to is_cdna() #4051

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ROCm: Add gfx950 (MI355X/CDNA4) to is_cdna() and include PR #4021 fixes#4050

ROCm: Add gfx950 (MI355X/CDNA4) to is_cdna() and include PR #4021 fixes#4050
GoldenGrapeGentleman wants to merge 18 commits intounslothai:mainfrom
GoldenGrapeGentleman:fix/warn-unsupported-lora-targets

GoldenGrapeGentleman commented Feb 14, 2026

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 14, 2026

Uh oh!

gemini-code-assist bot Feb 14, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

GoldenGrapeGentleman commented Feb 14, 2026

Summary

Problem

Changes

Verified on 8× AMD Instinct MI355X (gfx950), ROCm 7.1

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants